In [1]:
__author__ = 'Alice Jacques <alice.jacques@noirlab.edu>, NOIRLab Astro Data Lab Team <datalab@noirlab.edu>' 
__version__ = '20210831' #yyyymmdd 
__datasets__ = ['ls_dr8','sdss_dr16','gaia_dr2','des_dr1','smash_dr2','unwise_dr1','allwise','nsc_dr2'] 
__keywords__ = ['crossmatch','joint query','mydb','vospace','image cutout']

How to use the pre-crossmatched tables at Astro Data Lab

by Alice Jacques and the NOIRLab Astro Data Lab Team

Goals

  • Learn how to use a pre-crossmatched table to do a joint query on two Data Lab data sets
  • Learn how to do an efficient crossmatch of a user-provided data table against a Data Lab pre-crossmatched table

Summary

Crossmatch table naming template

The crossmatch tables at Astro Data Lab are named as follows:

schema1.xNpN__table1__schema2__table2

where the N in NpN encode the numerical value of the crossmatching radius (since dots '.' are not allowed in table names).

Example:

ls_dr8.x1p5__tractor_primary__nsc_dr2__object

is a crossmatch table (indicated by the leading x), located in the ls_dr8 schema, and it crossmatches the ls_dr8.tractor_primary table with the nsc_dr2.object table (which lives in the nsc_dr2 schema) within a 1.5 arcseconds radius ('1p5') .

This is admittedly long, but clean, consistent, and most importantly, parsable. The use of double-underscores '__' is to distinguish from single underscores often used in schema and table names.

Columns in crossmatch tables

All crossmatch tables shall be minimalist, i.e. have only these columns: id1,ra1,dec1,id2,ra2,dec2,distance. Column descriptions in the crossmatch table shall contain the original column names in parentheses (makes it parsable).

For example:

ls_dr8.x1p5__tractor_primary__nsc_dr2__object

Column Description Datatype
id1 ID in left/first table (ls_id) BIGINT
ra1 Right ascension in left/first table (ra) DOUBLE
dec1 Declination in left/first table (dec) DOUBLE
id2 ID in right/second table (id) VARCHAR
ra2 Right ascension in right/second table (ra) DOUBLE
dec2 Declination in right/second table (dec) DOUBLE
distance Distance between ra1,dec1 and ra2,dec2 (arcsec) DOUBLE

Datatypes in crossmatch tables

  • The column data types in a crossmatch table for columns id1 and id2 shall be retained from the mother tables. The example above, BIGINT, is valid in many cases, but need not be for all data sets.
  • The data types for columns ra1, dec1, ra2, dec2 shall be DOUBLE, which they usually will be anyway.
  • The column distance can be either REAL or DOUBLE.

Overview

  • The following 5 data sets are considered the main reference tables and are crossmatched against all data sets (if there is sky overlap) and when a new data set is ingested:
    • latest gaia_drN.gaia_source
    • latest nsc_drN.object
    • latest unwise_drN.object
    • allwise.source
    • latest sdss_drN.specobj
  • "Crossmatch" means for now "single nearest neighbor" (and this is the current mode at Data Lab).
  • Object tables only, not single epoch measurements or metadata tables.
  • For every crossmatch table with table1 as the left/first table and table2 as the right/second table, there exists a corresponding crossmatch table with table2 as the left/first table and table1 as the right/second table.
    • For example, allwise.x1p5__source__des_dr2__main and des_dr2.x1p5__main__allwise__source.

The list of available crossmatch tables can be viewed on our query interface here under their respective schema.

Disclaimer & attribution

If you use this notebook for your published science, please acknowledge the following:

Imports and setup

In [2]:
# std lib
from getpass import getpass

# 3rd party
from astropy.utils.data import download_file  #import file from URL
from matplotlib.ticker import NullFormatter
import pylab as plt
import matplotlib
%matplotlib inline

# Data Lab
from dl import authClient as ac, queryClient as qc, storeClient as sc
from dl.helpers.utils import convert # converts table to Pandas dataframe object

Authentication

Much of the functionality of Data Lab can be accessed without explicitly logging in (the service then uses an anonymous login). But some capacities, for instance saving the results of your queries to your virtual storage space, require a login (i.e. you will need a registered user account).

If you need to log in to Data Lab, issue this command, and respond according to the instructions:

In [3]:
#ac.login(input("Enter user name: (+ENTER) "),getpass("Enter password: (+ENTER) "))
ac.whoAmI()
Out[3]:
'demo00'

Accessing the pre-crossmatched tables

We can use Data Lab's Query Client to access the pre-crossmatched tables hosted by Data Lab. First let's get a total count of the number of objects (nrows) in SDSS DR16 that are also in LS DR8:

In [4]:
%%time
query="SELECT nrows FROM tbl_stat WHERE schema='sdss_dr16' and tbl_name='x1p5__specobj__ls_dr8__tractor_primary'"

# Call query manager
response = qc.query(sql=query)
print(response)
nrows
4542898

CPU times: user 28.6 ms, sys: 3.95 ms, total: 32.6 ms
Wall time: 80.8 ms

Now let's print just the first 100 rows:

In [5]:
query = "SELECT * FROM sdss_dr16.x1p5__specobj__ls_dr8__tractor_primary LIMIT 100"
response = qc.query(sql=query)
result = convert(response) # convert the table into a Pandas dataframe object
result
Out[5]:
id1 ra1 dec1 id2 ra2 dec2 distance
0 3384465917919389696 287.22826 48.064735 8797230351783516 287.228165 48.064735 0.000063
1 3384466192797296640 287.44889 48.229698 8797230414957399 287.448870 48.229697 0.000014
2 3384462344506599424 287.38750 48.168965 8797230414890143 287.387517 48.168933 0.000034
3 3384463718896134144 287.69779 48.382804 8797230477803600 287.697861 48.382752 0.000070
4 3384465093285668864 287.54718 48.407654 8797230477804882 287.547174 48.407548 0.000106
... ... ... ... ... ... ... ...
95 3384471690355435520 287.70990 48.888661 8797230602453456 287.709937 48.888637 0.000034
96 3384469491332179968 287.66389 48.944252 8797230602454731 287.663800 48.944491 0.000247
97 3384480486448457728 287.22115 48.827232 8797230540199804 287.221105 48.827183 0.000057
98 3384477737669388288 287.29420 48.927487 8797230602388155 287.294186 48.927487 0.000009
99 3384470590843807744 287.46812 49.027895 8797230602391658 287.468139 49.027900 0.000013

100 rows × 7 columns

Writing a JOIN query

In order to extract only the relevant columns pertaining to our science question from multiple data tables, we may write a query that uses a JOIN statement. There are 4 main types of JOIN statements that we could use, and which one we decide to choose depends on how we want the information to be extracted.

  1. (INNER) JOIN: Returns rows that have matching values in both tables
  2. LEFT (OUTER) JOIN: Returns all rows from the left table, and the matched rows from the right table
  3. RIGHT (OUTER) JOIN: Returns all rows from the right table, and the matched rows from the left table
  4. FULL (OUTER) JOIN: Returns all rows when there is a match in either left or right table

Take a moment to look over the figure below outlining the various JOIN statement types.
NOTE: the default JOIN is an INNER JOIN.

JOIN LATERAL

In nearest neighbor crossmatch queries, we use JOIN LATERAL, which is like a SQL foreach loop that will iterate over each row in a result set and evaluate a subquery using that row as a parameter.

Joint query of LS and SDSS catalogs

Here we will examine spectroscopic redshifts from SDSS DR16 and photometry from LS DR8. The two crossmatch tables related to these two catalogs are: ls_dr8.x1p5__tractor__sdss_dr16__specobj and sdss_dr16.x1p5__specobj__ls_dr8__tractor_primary. The choice of which of these two crossmatch tables to use should be based on the science question being posed. For instance, the question 'how does a galaxy's structure change with redshift?' is dependent on the redshift values obtained from SDSS DR16, so we should use the crossmatch table that has SDSS DR16 as the first table. So, the relevant information we want from our 3 tables of interest for this example are:

  1. "X" = sdss_dr16.x1p5__specobj__ls_dr8__tractor_primary
    • ra1 (RA of sdss object)
    • dec1 (Dec of sdss object)
  2. "S" = sdss_dr16.specobj
    • z (redshift)
    • class (spectroscopic class: GALAXY, QSO, or STAR)
    • veldisp (velocity dispersion)
    • veldisperr (error in velocity dispersion)
  3. "L" = ls_dr8.tractor
    • ra (RA of ls object)
    • dec (Dec of ls object)
    • type (morphological model: PSF=stellar, REX=round exponential galaxy, DEV=deVauc, EXP=exponential, COMP=composite, DUP=Gaia source fit by different model)
    • g_r (computed g-r color)
    • r_z (computed r-z color)

Write the query

Now that we know what we want and where we want it from, let's write the query and then print the results on screen. Here we use two join statements: the first will search in the SDSS DR16 specobj table for rows that have the same SDSS id value (specobjid) as in the pre-crossmatched table (id1) and retrieve the desired columns from the SDSS DR16 specobj table. The second will search in the LS DR8 tractor table for rows that have the same LS id value (ls_id) as in the pre-crossmatched table (id2) and retrieve the desired columns from the LS DR8 tractor table.

In [6]:
query = ("""SELECT 
           X.ra1 as ra_sdss,X.dec1 as dec_sdss,
           S.z,S.class,S.veldisp,S.veldisperr,
           L.ra as ra_ls,L.dec as dec_ls,L.type,L.g_r,L.r_z
         FROM sdss_dr16.x1p5__specobj__ls_dr8__tractor_primary as X 
         JOIN sdss_dr16.specobj as S ON X.id1 = S.specobjid 
         JOIN ls_dr8.tractor_primary AS L ON X.id2 = L.ls_id
         WHERE X.ra1 BETWEEN %s and %s and X.dec1 BETWEEN %s and %s
         """) %(110,200,7.,40.)  #large region
print(query)
SELECT 
           X.ra1 as ra_sdss,X.dec1 as dec_sdss,
           S.z,S.class,S.veldisp,S.veldisperr,
           L.ra as ra_ls,L.dec as dec_ls,L.type,L.g_r,L.r_z
         FROM sdss_dr16.x1p5__specobj__ls_dr8__tractor_primary as X 
         JOIN sdss_dr16.specobj as S ON X.id1 = S.specobjid 
         JOIN ls_dr8.tractor_primary AS L ON X.id2 = L.ls_id
         WHERE X.ra1 BETWEEN 110 and 200 and X.dec1 BETWEEN 7.0 and 40.0
         
In [7]:
%%time
df = qc.query(sql=query,fmt='pandas')
df
CPU times: user 3.91 s, sys: 1.52 s, total: 5.43 s
Wall time: 1min 15s
Out[7]:
ra_sdss dec_sdss z class veldisp veldisperr ra_ls dec_ls type g_r r_z
0 123.73187 39.973827 1.763972 QSO 0.00000 0.000000 123.731867 39.973832 PSF 0.754528 0.498163
1 123.83651 39.987596 0.463355 GALAXY 236.51454 38.593452 123.836530 39.987605 DEV 1.963539 1.118027
2 123.81502 39.971574 0.067499 GALAXY 74.77858 12.133754 123.815039 39.971575 COMP 0.729049 0.522813
3 124.07318 39.746556 2.217832 QSO 0.00000 0.000000 124.073191 39.746567 PSF 0.152601 0.267639
4 123.12650 39.993317 0.124905 GALAXY 221.71207 10.674141 123.126449 39.993302 DEV 1.153234 0.772179
... ... ... ... ... ... ... ... ... ... ... ...
1122603 199.95811 39.775791 -0.000107 STAR 0.00000 0.000000 199.958081 39.775726 PSF 0.315632 0.075909
1122604 198.89367 39.929405 0.272973 GALAXY 175.36577 16.407667 198.893659 39.929375 DEV 1.322998 0.806646
1122605 199.98703 39.697369 0.188422 GALAXY 232.41400 14.471788 199.987008 39.697350 DEV 1.244230 0.775111
1122606 198.85961 39.951831 0.094873 GALAXY 118.25831 6.661118 198.859585 39.951808 DEV 0.815535 0.514042
1122607 198.95289 39.927247 0.182800 GALAXY 248.40268 18.373820 198.952876 39.927280 DEV 1.327106 0.778608

1122608 rows × 11 columns

Saving results to VOSpace

VOSpace is a convenient storage space for users to save their work. It can store any data or file type. We can save the results from the same query to our virtual storage space:

In [8]:
response = qc.query(sql=query,fmt='csv',out='vos://testresult.csv')

Let's ensure the file was saved in VOSpace:

In [9]:
sc.ls(name='vos://testresult.csv')
Out[9]:
'testresult.csv'

Now let's remove the file we just saved to VOSpace:

In [10]:
sc.rm (name='vos://testresult.csv')
Out[10]:
'OK'

Let's ensure the file was removed from VOSpace:

In [11]:
sc.rm (name='vos://testresult.csv')
Out[11]:
'A Node does not exist with the requested URI.'

Saving results to MyDB

MyDB is a useful OS remote per-user relational database that can store data tables. Furthermore, the results of queries can be directly saved to MyDB, as we show in the following example:

In [12]:
response = qc.query(sql=query, fmt='csv', out='mydb://testresult')

Ensure the table has been saved to MyDB by calling the mydb_list() function, which will list all tables currently in a user's MyDB:

In [13]:
print(qc.mydb_list(),"\n")
bgsfaint_dlnotebook
desi_tile
gaia_sample
gaia_sample_xmatch
gals
lowmassagn_dlnotebook
secondary_dark_subset
sv1targets_bright_secondary
sv1targets_dark_secondary
testresult
 

Now let's drop the table from our MyDB.

In [14]:
qc.mydb_drop('testresult')
Out[14]:
'OK'

Ensure it has been removed by calling the mydb_list() function again:

In [15]:
print(qc.mydb_list(),"\n")
bgsfaint_dlnotebook
desi_tile
gaia_sample
gaia_sample_xmatch
gals
lowmassagn_dlnotebook
secondary_dark_subset
sv1targets_bright_secondary
sv1targets_dark_secondary
 

Crossmatch a user-provided data table and a pre-crossmatched table

We can construct a query to run a crossmatch in the database using the q3c_join() function, which identifies all matching objects within a specified radius in degrees (see details on using Q3C functions). For this example, we will search only for the single nearest neighbor. For different examples of crossmatching, see our How to crossmatch tables notebook.

First, let's query a small selection of sample data from the Data Lab database and store it in MyDB as gaia_sample. This will act as our "user-provided table".

In [16]:
query = """SELECT source_id,ra,dec,parallax,pmra,pmdec 
            FROM gaia_dr2.gaia_source 
            WHERE ra<200 AND ra>124 AND random_id<10 
            LIMIT 70000"""
print(query)
SELECT source_id,ra,dec,parallax,pmra,pmdec 
            FROM gaia_dr2.gaia_source 
            WHERE ra<200 AND ra>124 AND random_id<10 
            LIMIT 70000
In [17]:
%%time
response = qc.query(sql=query,out='mydb://gaia_sample',drop=True)
CPU times: user 26.7 ms, sys: 1.06 ms, total: 27.8 ms
Wall time: 1min 57s

Write a crossmatch query

Next let's crossmatch our gaia_sample table with Data Lab's pre-crossmatched table between SMASH DR2 and allWISE smash_dr2.x1p5__object__allwise__source. We'll write our crossmatch query using the q3c_join() function as well as the q3c_dist() function, searching for the nearest neighbor within a 1.5 arcsec radius (which must be converted into degrees for the query, so we divide by 3600.0). We will then save it in MyDB as gaia_sample_xmatch.

In [18]:
%%time
qu = """SELECT
        G.source_id,ss.id1,ss.id2,G.ra,G.dec,ss.ra1,ss.dec1,ss.ra2,ss.dec2,
        (q3c_dist(G.ra,G.dec,ss.ra1,ss.dec1)*3600.0) as dist_arcsec
        FROM mydb://gaia_sample AS G
        JOIN LATERAL (
            SELECT S.id1,S.id2,S.ra1,S.dec1,S.ra2,S.dec2
            FROM 
                smash_dr2.x1p5__object__allwise__source AS S
            WHERE 
                q3c_join(G.ra,G.dec,S.ra1,S.dec1, 1.5/3600.0)
            ORDER BY
                q3c_dist(G.ra,G.dec,S.ra1,S.dec1)
            ASC LIMIT 1
            ) AS ss ON true
    """
resp = qc.query(sql=qu,out='mydb://gaia_sample_xmatch',drop=True)
CPU times: user 26.9 ms, sys: 1.69 ms, total: 28.6 ms
Wall time: 1.38 s

We can query the newly created table from MyDB and convert it into a Pandas Dataframe object in order to print it on screen:

In [19]:
query = "SELECT * FROM mydb://gaia_sample_xmatch"
df = qc.query(sql=query,fmt='pandas')
df
Out[19]:
source_id id1 id2 ra dec ra1 dec1 ra2 dec2 dist_arcsec
0 5205696522501120512 Field80.880606 1482m743_ac51-023387 150.869988 -74.496170 150.869990 -74.496177 150.869098 -74.496177 0.023617
1 5205674807146027008 Field80.61312 1535m743_ac51-011624 151.032199 -74.745129 151.032198 -74.745131 151.032752 -74.745275 0.008880
2 6141410299609363840 Field127.835819 1993m394_ac51-059307 199.455692 -38.742563 199.455696 -38.742564 199.455683 -38.742551 0.010309
3 5467823633613278720 Field85.328869 1577m288_ac51-061337 157.482148 -28.132586 157.482144 -28.132586 157.482181 -28.132587 0.013867
4 5659712743651975424 Field76.497391 1486m243_ac51-024709 149.263676 -24.547096 149.263677 -24.547099 149.263713 -24.547078 0.011663
... ... ... ... ... ... ... ... ... ... ...
637 5388501192592385536 Field91.325126 1638m425_ac51-015781 163.258046 -42.610578 163.258049 -42.610577 163.258049 -42.610625 0.009637
638 5459798298242774016 Field77.1011912 1493m334_ac51-048226 149.796546 -33.028633 149.796547 -33.028635 149.796653 -33.028629 0.005438
639 6152131053375898496 Field123.257555 1918m409_ac51-043838 191.736605 -40.442602 191.736609 -40.442599 191.736650 -40.442600 0.016839
640 5441173017945297920 Field163.329494 1591m379_ac51-024901 159.794886 -38.270692 159.794889 -38.270695 159.794912 -38.270608 0.012901
641 5390277174450667520 Field165.440845 1661m409_ac51-015490 165.908202 -41.241788 165.908207 -41.241784 165.908317 -41.241735 0.017132

642 rows × 10 columns

Write the joint query

Now we can write a query using the JOIN statement in order to extract the columns we want from our tables of interest. Just as in the previous section, let's first make an outline of which tables we'd like to extract columns from.

  1. "X" = mydb://gaia_sample_xmatch
    • source_id (source id from gaia dr2)
    • id1 (source id from smash dr1)
    • id2 (source id from allwise)
    • ra (RA value from gaia dr2)
    • dec (Dec value from gaia dr2)
  2. "s" = smash_dr2.object
    • gmag (weighted-avarage, calibrated g-band magnitude, 99.99 if no detection)
    • rmag (weighted-avarage, calibrated r-band magnitude, 99.99 if no detection)
    • zmag (weighted-avarege, calibrated z-band magnitude, 99.99 if no detection)
  3. "a" = allwise.source
    • w1mpro (W1 magnitude measured with profile-fitting photometry)
    • w2mpro (W2 magnitude measured with profile-fitting photometry)
    • w3mpro (W3 magnitude measured with profile-fitting photometry)
  4. "g" = mydb://gaia_sample
    • parallax
    • pmra (proper motion in right ascension direction)
    • pmdec (proper motion in declination direction)
In [20]:
query = ("""SELECT 
           X.source_id,X.id1,X.id2,X.ra,X.dec,
           s.gmag,s.rmag,s.zmag,
           a.w1mpro,a.w2mpro,a.w3mpro,
           g.parallax,g.pmra,g.pmdec
         FROM mydb://gaia_sample_xmatch as X 
         JOIN smash_dr2.object as s ON X.id1 = s.id 
         JOIN allwise.source AS a ON X.id2 = a.source_id
         JOIN mydb://gaia_sample AS g ON X.source_id = g.source_id
         """)
print(query)
SELECT 
           X.source_id,X.id1,X.id2,X.ra,X.dec,
           s.gmag,s.rmag,s.zmag,
           a.w1mpro,a.w2mpro,a.w3mpro,
           g.parallax,g.pmra,g.pmdec
         FROM mydb://gaia_sample_xmatch as X 
         JOIN smash_dr2.object as s ON X.id1 = s.id 
         JOIN allwise.source AS a ON X.id2 = a.source_id
         JOIN mydb://gaia_sample AS g ON X.source_id = g.source_id
         
In [21]:
df = qc.query(sql=query,fmt='pandas')
df
Out[21]:
source_id id1 id2 ra dec gmag rmag zmag w1mpro w2mpro w3mpro parallax pmra pmdec
0 5205696522501120512 Field80.880606 1482m743_ac51-023387 150.869988 -74.496170 20.678373 19.158497 18.168558 16.026 16.102 13.308 0.551829 -9.752906 4.869220
1 5205674807146027008 Field80.61312 1535m743_ac51-011624 151.032199 -74.745129 21.323969 19.730803 18.502588 16.329 17.933 13.250 0.858763 -6.702968 1.961605
2 6141410299609363840 Field127.835819 1993m394_ac51-059307 199.455692 -38.742563 15.066941 99.990000 99.990000 12.891 12.946 12.681 0.726484 -8.683367 -6.934360
3 5467823633613278720 Field85.328869 1577m288_ac51-061337 157.482148 -28.132586 17.074688 16.452244 16.070230 14.428 14.616 12.749 0.602111 -1.884323 -1.545751
4 5659712743651975424 Field76.497391 1486m243_ac51-024709 149.263676 -24.547096 18.163761 17.786285 17.640320 16.565 17.321 12.651 0.201747 -6.185929 2.255060
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
637 5388501192592385536 Field91.325126 1638m425_ac51-015781 163.258046 -42.610578 16.392881 15.905405 15.651100 14.543 14.597 12.921 0.459421 -6.255805 0.402247
638 5459798298242774016 Field77.1011912 1493m334_ac51-048226 149.796546 -33.028633 22.306787 20.748846 18.740670 16.427 15.961 12.290 0.869519 -0.222829 2.052986
639 6152131053375898496 Field123.257555 1918m409_ac51-043838 191.736605 -40.442602 16.014399 15.266039 14.703455 12.854 12.924 12.413 0.175101 -7.460469 -1.794713
640 5441173017945297920 Field163.329494 1591m379_ac51-024901 159.794886 -38.270692 20.684067 19.644320 19.017138 17.330 17.485 12.472 0.660549 -11.618653 2.288410
641 5390277174450667520 Field165.440845 1661m409_ac51-015490 165.908202 -41.241788 17.877280 17.364906 17.069042 15.736 16.608 12.337 0.178047 -5.586176 1.011907

642 rows × 14 columns

Speed test

Here we compare the speed of using the q3c_join() function to crossmatch directly in a JOIN query (query1) versus using a pre-crossmatched table in a JOIN query (query2). We select objects from the two catalogs and retrieve the same specified columns for the two queries. We will see that query1 times out after 300 seconds (5 minutes) and fails to retrieve results, while query2 runs for about 60-90 seconds (1-1.5 minutes) and will retrieve the 3.6 million rows we queried for.

First, running the crossmatch ourselves:

In [22]:
%%time
query1 = """SELECT
           a.source_id as id1,a.ra as ra1,a.dec as dec1,a.pmdec,a.pmra,a.w1mpro,a.w2mpro,
           gg.specobjid as id2,gg.ra as ra2,gg.dec as dec2,gg.z,gg.class,gg.veldisp,gg.veldisperr,
           (q3c_dist(a.ra,a.dec,gg.ra,gg.dec)*3600.0) as dist_arcsec 
         FROM 
            allwise.source AS a
         INNER JOIN LATERAL (
            SELECT s.specobjid,s.ra,s.dec,s.z,s.class,s.veldisp,s.veldisperr
            FROM 
                sdss_dr16.specobj AS s
            WHERE
                q3c_join(a.ra, a.dec, s.ra, s.dec, 1.5/3600.0)
            ORDER BY
                random()
            ASC LIMIT 1
        ) as gg ON true
"""
df1 = qc.query(sql=query1,timeout=300,fmt='pandas')
df1
---------------------------------------------------------------------------
queryClientError                          Traceback (most recent call last)
<timed exec> in <module>

/data0/sw/anaconda3/lib/python3.7/site-packages/noaodatalab-2.19.1-py3.7.egg/dl/Util.py in __call__(self, *args, **kw)
     80             return function(self.obj, *args, **kw)
     81         else:
---> 82             return function(*args, **kw)
     83 
     84     def __repr__(self):

/data0/sw/anaconda3/lib/python3.7/site-packages/noaodatalab-2.19.1-py3.7.egg/dl/queryClient.py in query(token, adql, sql, fmt, out, async_, drop, profile, **kw)
    542     return qc_client._query (token=def_token(token), adql=adql, sql=sql, 
    543                              fmt=fmt, out=out, async_=async_, drop=drop, profile=profile,
--> 544                              **kw)
    545 
    546 

/data0/sw/anaconda3/lib/python3.7/site-packages/noaodatalab-2.19.1-py3.7.egg/dl/queryClient.py in _query(self, token, adql, sql, fmt, out, async_, drop, profile, **kw)
   2028         r = requests.get (dburl, headers=headers, timeout=timeout)
   2029         if r.status_code != 200:
-> 2030             raise queryClientError (r.text)
   2031         resp = qcToString(r.content)
   2032 

queryClientError: Error: QM: Query timeout at 300 sec

Now, the same but using pre-crossmatched tables:

In [23]:
%%time
query2 = """SELECT 
           X.id1,X.id2,X.ra1,X.dec1,X.ra2,X.dec2,X.distance as dist_arcsec,
           a.pmdec,a.pmra,a.w1mpro,a.w2mpro,
           s.z,s.class,s.veldisp,s.veldisperr
         FROM 
             allwise.x1p5__source__sdss_dr16__specobj as X 
         JOIN 
             allwise.source as a ON X.id1 = a.source_id 
         JOIN 
             sdss_dr16.specobj AS s ON X.id2 = s.specobjid
         """
df2 = qc.query(sql=query2,fmt='pandas')
df2
CPU times: user 17.9 s, sys: 7.66 s, total: 25.5 s
Wall time: 4min 14s
Out[23]:
id1 id2 ra1 dec1 ra2 dec2 dist_arcsec pmdec pmra w1mpro w2mpro z class veldisp veldisperr
0 1466p560_ac51-033686 -9223371638696554496 146.373125 56.176027 146.37317 56.175996 0.000040 27.0 48.0 15.639 14.821 0.905140 QSO 0.00000 0.000000
1 1466p560_ac51-015830 -9223370539184926720 146.091557 55.865180 146.09155 55.865082 0.000098 79.0 77.0 15.913 15.520 1.606482 GALAXY 0.00000 -4.000000
2 1466p560_ac51-015998 -9223368340161671168 146.280785 55.577186 146.28085 55.577054 0.000137 -854.0 -613.0 16.442 16.633 1.020406 GALAXY 850.00000 -3.000000
3 1466p560_ac51-015929 -9223367790405857280 146.294858 55.672178 146.29491 55.672121 0.000064 104.0 -163.0 16.202 16.305 0.669265 GALAXY 75.59966 39.807648
4 1466p560_ac51-015798 -9223367515527950336 146.111274 55.681602 146.11125 55.681554 0.000050 -15.0 81.0 16.105 14.801 1.466512 QSO 0.00000 0.000000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3620144 1467p575_ac51-012831 9222401841828220928 145.836112 57.248444 145.83593 57.248505 0.000116 -190.0 490.0 16.499 15.350 1.111875 QSO 0.00000 0.000000
3620145 1467p575_ac51-012735 9222402666461941760 145.768396 57.335220 145.76845 57.335061 0.000162 -282.0 264.0 16.121 16.448 0.544433 GALAXY 186.33210 49.277443
3620146 1467p575_ac51-015524 9222402941339848704 145.926687 57.212789 145.92668 57.212863 0.000074 -302.0 649.0 16.208 15.954 -0.000130 STAR 0.00000 0.000000
3620147 1467p575_ac51-012583 9222403216217755648 145.790488 57.314825 145.79047 57.314682 0.000143 -593.0 212.0 15.423 15.428 0.733144 GALAXY 284.18698 63.456600
3620148 1467p575_ac51-015457 9222403491095662592 145.944502 57.175794 145.94454 57.175825 0.000037 139.0 418.0 15.999 14.704 2.104238 QSO 0.00000 0.000000

3620149 rows × 15 columns

For completeness, we switch the order of the queries and query from a different catalog.

We again select objects from two catalogs and retrieve the same specified columns for two queries. query3 uses a pre-crossmatched table in a JOIN query and query4 crossmatches directly in the JOIN query. We will see that query3 runs for about 60-90 seconds (1-1.5 minutes) and will retrieve the 4.4 million rows we queried for, while query4 times out after 300 seconds (5 minutes) and fails to retrieve results.

First, using pre-crossmatched tables:

In [24]:
%%time
query3 = """SELECT 
           X.id1,X.id2,X.ra1,X.dec1,X.ra2,X.dec2,X.distance as dist_arcsec,
           u.mag_w1_vg,u.mag_w2_vg,s.z,s.class,s.veldisp,s.veldisperr
         FROM 
             unwise_dr1.x1p5__object__sdss_dr16__specobj as X 
         JOIN 
             unwise_dr1.object as u ON X.id1 = u.unwise_objid 
         JOIN 
             sdss_dr16.specobj AS s ON X.id2 = s.specobjid
         ORDER BY 
             random()
         """
df3 = qc.query(sql=query3,fmt='pandas')
df3
CPU times: user 22.1 s, sys: 8.34 s, total: 30.4 s
Wall time: 5min 32s
Out[24]:
id1 id2 ra1 dec1 ra2 dec2 dist_arcsec mag_w1_vg mag_w2_vg z class veldisp veldisperr
0 1484p620o0015328 2683195188741957632 149.053418 62.203977 149.05339 62.203976 0.000013 14.523033 14.611855 -0.000004 STAR 0.000000 0.000000
1 1395p181o0001707 2743991310615603200 139.260480 17.507336 139.26043 17.507311 0.000054 13.554090 13.554952 0.026897 GALAXY 61.602707 8.303150
2 1483p408o0015910 5219936775288672256 149.065872 41.115102 149.06588 41.115090 0.000013 15.619832 15.648850 0.587233 GALAXY 258.225650 63.808860
3 1957p484o0005714 7504150748982431744 195.495461 47.970752 195.49529 47.970955 0.000233 15.389571 15.274837 0.575522 GALAXY 352.117700 123.904495
4 1497p000o0005940 301805523861268480 149.556266 -0.386921 149.55621 -0.386684 0.000243 15.242478 14.920922 0.081059 GALAXY 76.905030 19.077383
... ... ... ... ... ... ... ... ... ... ... ... ... ...
4374677 1531p227o0040183 -7620946682152988672 153.079834 22.988416 153.07981 22.988422 0.000023 17.118313 inf 0.779773 GALAXY 804.446300 262.154200
4374678 1997p408o0023456 -9001445659641622528 200.278958 41.432067 200.27893 41.432007 0.000063 14.537611 14.393454 0.201212 GALAXY 142.949700 9.106541
4374679 1200p272o0016295 1046098770590197760 120.488696 27.439264 120.48870 27.439283 0.000019 12.576180 12.542265 0.076992 GALAXY 262.740780 7.050733
4374680 1904p499o0027799 3291094014941489152 191.578573 50.699750 191.57838 50.699834 0.000148 13.355895 13.306557 -0.000167 STAR 0.000000 0.000000
4374681 3463m016o0016960 4910209295633895424 346.345288 -1.142569 346.34526 -1.142585 0.000032 15.530380 15.188126 0.651222 QSO 0.000000 0.000000

4374682 rows × 13 columns

Now, running the crossmatch ourselves:

In [25]:
%%time
query4 = """SELECT
           u.unwise_objid as id1,u.ra as ra1,u.dec as dec1,u.mag_w1_vg,u.mag_w2_vg,
           ss.specobjid as id2,ss.ra as ra2,ss.dec as dec2,ss.z,ss.class,ss.veldisp,ss.veldisperr,
           (q3c_dist(u.ra,u.dec,ss.ra,ss.dec)*3600.0) as dist_arcsec 
         FROM 
            unwise_dr1.object AS u
         INNER JOIN LATERAL (
            SELECT s.specobjid,s.ra,s.dec,s.z,s.class,s.veldisp,s.veldisperr
            FROM 
                sdss_dr16.specobj AS s
            WHERE
                q3c_join(u.ra, u.dec, s.ra, s.dec, 1.5/3600.0)
            ORDER BY
                random()
            ASC LIMIT 1
        ) as ss ON true
"""
df4 = qc.query(sql=query4,fmt='pandas')
df4
---------------------------------------------------------------------------
queryClientError                          Traceback (most recent call last)
<timed exec> in <module>

/data0/sw/anaconda3/lib/python3.7/site-packages/noaodatalab-2.19.1-py3.7.egg/dl/Util.py in __call__(self, *args, **kw)
     80             return function(self.obj, *args, **kw)
     81         else:
---> 82             return function(*args, **kw)
     83 
     84     def __repr__(self):

/data0/sw/anaconda3/lib/python3.7/site-packages/noaodatalab-2.19.1-py3.7.egg/dl/queryClient.py in query(token, adql, sql, fmt, out, async_, drop, profile, **kw)
    542     return qc_client._query (token=def_token(token), adql=adql, sql=sql, 
    543                              fmt=fmt, out=out, async_=async_, drop=drop, profile=profile,
--> 544                              **kw)
    545 
    546 

/data0/sw/anaconda3/lib/python3.7/site-packages/noaodatalab-2.19.1-py3.7.egg/dl/queryClient.py in _query(self, token, adql, sql, fmt, out, async_, drop, profile, **kw)
   2028         r = requests.get (dburl, headers=headers, timeout=timeout)
   2029         if r.status_code != 200:
-> 2030             raise queryClientError (r.text)
   2031         resp = qcToString(r.content)
   2032 

queryClientError: Error: QM: Query timeout at 300 sec

Appendix

A clear benefit of pre-crossmatched tables is that they contain the positions of the same objects in two datasets. We can use this to e.g. fetch images of an object from both surveys.

A1. unWISE DR1 vs LS DR8

Here we will compare two images of the same object from two different catalogs, unWISE DR1 and LS DR8.

Function to retrieve cutouts

In [26]:
def make_cutout_comparison_table(ra_in1, dec_in1, layer1, layer2, pixscale, ra_in2=None, dec_in2=None):
    """
    Obtain color JPEG images from Legacy Survey team cutout tool at NERSC
    """    
    img1 = []
    img2 = []
    
    for i in range(len(ra_in1)):
        cutout_url1 = "https://www.legacysurvey.org/viewer/cutout.jpg?ra=%g&dec=%g&layer=%s&pixscale=%s" % (ra_in1[i],dec_in1[i],layer1,pixscale)
        img = plt.imread(download_file(cutout_url1,cache=True,show_progress=False,timeout=120))
        img1.append(img)
        
        cutout_url2 = "https://www.legacysurvey.org/viewer/cutout.jpg?ra=%g&dec=%g&layer=%s&pixscale=%s" % (ra_in2[i],dec_in2[i],layer2,pixscale)
        img = plt.imread(download_file(cutout_url2,cache=True,show_progress=False,timeout=120))
        img2.append(img)

    return img1,img2

Function to generate plots

In [27]:
def plot_cutouts(img1,img2,cat1,cat2):
    """
    Plot images in two rows with 5 images in each row
    """
    fig = plt.figure(figsize=(21,7))

    for i in range(len(img1)):
        ax = fig.add_subplot(2,6,i+1)
        ax.imshow(img1[i])
        ax.xaxis.set_major_formatter(NullFormatter())
        ax.yaxis.set_major_formatter(NullFormatter())
        ax.tick_params(axis='both',which='both',length=0)
        ax.text(0.02,0.93,'ra=%.5f'%list_ra1[i],transform=ax.transAxes,fontsize=12,color='white')
        ax.text(0.02,0.85,'dec=%.5f'%list_dec1[i],transform=ax.transAxes,fontsize=12,color='white')
        ax.text(0.02,0.77,cat1,transform=ax.transAxes,fontsize=12,color='white')

        ax = fig.add_subplot(2,6,i+7)
        ax.imshow(img2[i])
        ax.xaxis.set_major_formatter(NullFormatter())
        ax.yaxis.set_major_formatter(NullFormatter())
        ax.tick_params(axis='both',which='both',length=0)
        ax.text(0.02,0.93,'ra=%.5f'%list_ra2[i],transform=ax.transAxes,fontsize=12,color='white')
        ax.text(0.02,0.85,'dec=%.5f'%list_dec2[i],transform=ax.transAxes,fontsize=12,color='white')
        ax.text(0.02,0.77,cat2,transform=ax.transAxes,fontsize=12,color='white')

    plt.subplots_adjust(wspace=0.02, hspace=0.03)

Write query to randomly select five targets (RA/Dec positions) from unWISE DR1 and LS DR8 crossmatch table

... then save them as arrays and set the captions, layers, and pixscale. Finally we plot the cutout images.

In [28]:
%%time
q = """SELECT ra1,dec1,ra2,dec2 
        FROM unwise_dr1.x1p5__object__ls_dr8__tractor_primary 
        WHERE ra1>300 AND dec1>33 
        ORDER BY random() 
        LIMIT 5"""

r = qc.query(sql=q,fmt='pandas')

list_ra1=r['ra1'].values       # ".values" convert to numpy array
list_dec1=r['dec1'].values
list_ra2=r['ra2'].values       
list_dec2=r['dec2'].values

cat1='unWISE'
cat2='ls dr8'
layer1='unwise-neo6'
layer2='ls-dr8'
pixscale='0.3'
img1,img2 = make_cutout_comparison_table(list_ra1,list_dec1,layer1,layer2,
                                         pixscale,list_ra2,list_dec2)
plot_cutouts(img1,img2,cat1,cat2)
CPU times: user 552 ms, sys: 68.5 ms, total: 620 ms
Wall time: 48.2 s

A2. SDSS vs DES DR1

Here we will compare two images of the same object from two different catalogs, SDSS and DES DR1.

Write query to randomly select five targets (RA/Dec positions) from SDSS DR16 and DES DR1 crossmatch table

... then save them as arrays and set the captions, layers, and pixscale. Finally we plot the cutout images.

In [29]:
%%time
q = """SELECT ra1,dec1,ra2,dec2 
        FROM sdss_dr16.x1p5__specobj__des_dr1__main 
        ORDER BY random() 
        LIMIT 5"""

r = qc.query(sql=q,fmt='pandas')

list_ra1=r['ra1'].values       # ".values" convert to numpy array
list_dec1=r['dec1'].values
list_ra2=r['ra2'].values       
list_dec2=r['dec2'].values

cat1='sdss dr16'
cat2='des dr1'
layer1='sdss'
layer2='des-dr1'
pixscale='0.25'
img1,img2 = make_cutout_comparison_table(list_ra1,list_dec1,layer1,layer2,
                                         pixscale,list_ra2,list_dec2)
plot_cutouts(img1,img2,cat1,cat2)
CPU times: user 487 ms, sys: 27.5 ms, total: 515 ms
Wall time: 9.48 s

A3. Cool galaxy finds: SDSS vs DES DR1

We compare two images of the same galaxy from two different catalogs, SDSS and DES DR1. We use a list of identified galaxies (RA/Dec positions) to compare the difference in observable features and quality between the two catalogs.

First we import the CSV file of identified galaxies (RA/Dec positions) into MyDB:

In [30]:
qc.mydb_import('gals','./gals.csv',drop=True)
Out[30]:
'OK'

We write the query to select the first five RA/Dec positions from our table. We then save them as arrays and set the captions, layers, and pixscale. Finally we plot the cutout images.

In [31]:
qg = """SELECT ra,dec 
        FROM mydb://gals 
        LIMIT 5"""
rg = qc.query(sql=qg)
rp = convert(rg)
list_ra1=rp['ra'].values 
list_dec1=rp['dec'].values
list_ra2=rp['ra'].values
list_dec2=rp['dec'].values

cat1='sdss dr16'
cat2='des dr1'
layer1='sdss'
layer2='des-dr1'
pixscale='0.5'

img1,img2 = make_cutout_comparison_table(list_ra1,list_dec1,layer1,layer2,
                                        pixscale,ra_in2=list_ra1,dec_in2=list_dec1)
plot_cutouts(img1,img2,cat1,cat2)

We write the next query to select the next five RA/Dec positions from our table. We then save them as arrays and set the captions, layers, and pixscale. Finally we plot the cutout images.

In [32]:
qg = """SELECT ra,dec 
        FROM mydb://gals 
        LIMIT 5 
        OFFSET 5"""
rg = qc.query(sql=qg)
rp = convert(rg)
list_ra1=rp['ra'].values      
list_dec1=rp['dec'].values
list_ra2=rp['ra'].values     
list_dec2=rp['dec'].values

img1,img2 = make_cutout_comparison_table(list_ra1,list_dec1,layer1,layer2,
                                        pixscale,ra_in2=list_ra1,dec_in2=list_dec1)
plot_cutouts(img1,img2,cat1,cat2)

We write the next query to select the last five RA/Dec positions from our table. We then save them as arrays and set the captions, layers, and pixscale. Finally we plot the cutout images.

In [33]:
qg = """SELECT ra,dec 
        FROM mydb://gals 
        LIMIT 5 
        OFFSET 10"""
rg = qc.query(sql=qg)
rp = convert(rg)
list_ra1=rp['ra'].values    
list_dec1=rp['dec'].values
list_ra2=rp['ra'].values     
list_dec2=rp['dec'].values

img1,img2 = make_cutout_comparison_table(list_ra1,list_dec1,layer1,layer2,
                                        pixscale,ra_in2=list_ra1,dec_in2=list_dec1)
plot_cutouts(img1,img2,cat1,cat2)

Resources & references